WGCNA-Based Identification of Hub Genes and Key Pathways Involved in Nonalcoholic Fatty Liver Disease

Background The morbidity of nonalcoholic fatty liver disease (NAFLD) has been rising, but the pathogenesis of NAFLD is still elusive. This study is aimed at determining NAFLD-related hub genes based on weighted gene coexpression network analysis (WGCNA). Methods GSE126848 dataset based construction of coexpression networks was performed based on WGCNA. Database for Annotation, Visualization, and Integrated Discovery (DAVID) was utilized for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. Hub genes were identified and validated in independent datasets and mouse model. Results We found that the steelblue module was most significantly correlated with NAFLD. Total 15 hub genes (NDUFA9, UQCRQ, NDUFB8, COPS5, RPS17, UBL5, PSMA3, PSMA1, SF3B5, MRPL27, RPL26, PDCD5, PFDN6, SNRPD2, PSMB3) were derived from both the coexpression and PPI networks and considered “true” hub genes. Functional enrichment analysis showed that the hub genes were related to NAFLD pathway and oxidative phosphorylation. Independent dataset-based analysis and the establishment of NAFLD mouse model confirmed the involvement of two hub genes NDUFA9 and UQCRQ in the pathogenesis of NAFLD. Conclusions Oxidative phosphorylation and NAFLD pathway may be crucially involved in the pathogenesis of NAFLD, and two hub genes NDUFA9 and UQCRQ might be diagnostic biomarkers and therapeutic targets for NAFLD.


Introduction
NAFLD is one of the most prevalent chronic liver diseases in the world, affecting approximately 30% of the adult population globally [1]. Based on disease severity, NAFLD is classified as nonalcoholic fatty liver (NAFL) or nonalcoholic steatohepatitis (NASH). Patients with NASH exhibit histological lobular inflammation and hepatocyte ballooning, while NASH may progress to fibrosis faster than NAFL [2]. Patients with NAFLD often develop cirrhosis and numerous liver-related complications. In the USA, NASH is the third leading cause of end-stage liver disease and hepatocellular malignancy [3].
NAFLD has drawn remarkable attention due to its widespread prevalence and socioeconomic burden. Currently, no drug or surgery has been approved for the treatment of NAFLD, and weight loss is the only proven option for the management of NAFLD [4]. Recently, bioinformatics tools based on gene expression profiling via high-throughput microarray technology have been applied to elucidate the pathogenesis of a variety of diseases, including NAFLD [5][6][7][8]. However, most of the studies only focused on the screening of differentially expressed genes (DEGs) rather than the functional connections among these genes by gene expression pattern analysis [6].
NAFLD-related DEGs can be identified using traditional bioinformatics methods, including the Limma software package. However, these procedures could lead to neglect of the genes that have little difference in fold change but contribute to the pathogenesis of NAFLD. Given that differential gene expression levels may not reflect the complexity of NAFLD, we employed a new analytical method WGCNA (Weighted Gene Co-expression Network Analysis) to determine essential genes and signaling pathways involved in the pathogenesis of NAFLD.
WGCNA uses data from thousands of the most varied genes or all the genes to identify gene sets of interest compared to genes that are only concerned with differential expression while analyzing significant associations with phenotypes. WGCNA has the following two advantages: one is to lose fewer genes; the other is to collect a large number of genes into several gene sets and associate them with phenotypes without multiple hypothesis tests [9,10]. Therefore, we used WGCNA to analyze the gene expression synthesis database of gene expression data set GSE126848 to identify hub genes and molecular pathways involved in NAFLD.

Data Collection.
Data of GSE126848 based on the GPL18573 platform (Illumina NextSeq 500 (Homo sapiens)) were obtained from Gene Expression Omnibus (GEO) database of NCBI (https://www.ncbi.nlm.nih.gov/geo/query/acc .cgi?acc=GSE126848). GSE126848 contained 57 liver biopsy samples from 14 normal healthy and 12 obese individuals as well as 15 NAFL and 16 NASH subjects. The dataset used a second-generation sequencing method to provide count data after the map. The workflow of this study involving coexpression network construction, hub genes identification, functional analysis, and validation is shown in Figure 1.
2.2. WGCNA. The DESeq2 method was utilized to standardize the library size of the count data and normalize the z -score. Coexpression network analysis was performed by using the WGCNA software package. The assessment of comparability was performed based on the correlation of gene expression levels with connectivity. The softConnectivity function of the package with 5,000 randomly selected genes was employed to calculate the connectivity. Construction of a WGCNA was conducted using the package, and the flashClust function of package flashClust was applied for cluster analysis. The cutreeDynamic function of the R package was used to identify the modules consisted of groups of genes with a similar strength pattern of connections within the network as well as shared functions [11,12].
Module eigengenes (MEs) were identified using the ME function of the package, and signed module membership (MM) was utilized to determine the correlation of MEs with individual genes. Gene modules associated with specific clinical traits were identified based on the correlation between MEs and clinical traits. Gene significance (GS) was defined as the correlation between the gene expression level and a clinical trait. Clinical traits used in this study included three NAFLD-like phenotypes comprising two disease states, NAFL and NASH, as well as obese and healthy states. Sample cluster analysis based on gene expression level and bar plot of MEs in each module was performed to verify the relationship between modules and traits. Gene connectivity in each module was determined, and hub genes were identified as the genes with the highest connectivity.

Identification of Clinically Significant
Modules. WGCNA R package was employed to identify clinical trait-related gene modules and hub genes [13]. The adjacency matrix was converted to a topological overlap matrix (TOM). Genes were clustered into distinct modules based on the TOM-based dissimilarity measure. To uncover key modules, soft-thresholding power, cut height, and minimal module size were set as 9 (scale-free R 2 = 0:9), 0.25, and 10, respectively. Biological function of the modules significantly correlated with the clinical traits was analyzed by GO and KEGG analyses, and hub genes for those modules were screened. The thresholds for screening hub genes were GS > 0:58 and MM > 0:83.

GO and KEGG Enrichment Analysis.
To analyze functional annotation of the identified genes, all genes in the candidate modules were subjected to gene ontology (GO) functional annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses using the OmicShare tools [14,15]. Functional enrichment was performed in the following three GO categories: biological process (BP), molecular function (MF), and cellular component (CC).

PPI Network Construction, Module Analysis, and Hub
Gene Identification. Biological interactions within the candidate modules were analyzed by using STRING v11.0 (https://string-db.org/). The enrichment in the intersections for each module was calculated using built-in algorithms of the STRING v11.0 and displayed by Cytoscape [16]. The cytoHubba plug-in in Cytoscape was employed to identify hub genes in PPI network based on degree method [17]. The components in the central node could be considered as the core proteins and significant hub genes with essential physiological functions. The hub genes identified from both the coexpression and PPI networks were considered as "true" hub genes for further analysis [18].

Independent Dataset-Based Validation of Candidate Hub
Genes. An independent dataset GSE89632 was obtained from the GEO database (http://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE89632) and used to validate whether the three hub genes identified from dataset GSE126848 were correlated with NAFLD.
2.8. NAFLD Mouse Model. All animal experiments were approved by Animal Use and Care Committee of Guangdong Provincial Hospital of Traditional Chinese Medicine (Approval number 2020064). C57BL/6J mice (male, weight 22-24 g, eight weeks old) were purchased from the Laboratory Animal Services Center of Guangzhou University of Traditional Chinese Medicine. After one week of acclimatization, the mice were kept under normal conditions with free access to food and water [22,23]. The mice in model group were fed with high-fat diet (HFD, 16% kCal protein, 58% kCal fat, and 26% kCal carbohydrate) for 12 weeks. In the control group, mice were fed a standard diet (21% kCal protein, 11% kCal fat, and 68% kCal carbohydrate). The mice were sacrificed by the cervical dislocation method, and liver tissues were dissected.
2.9. Histological Analysis. For histological analysis, the liver samples were collected and fixed with 4% paraformaldehyde and then embedded with paraffin for hematoxylin and eosin (H&E) staining. For Oil Red O staining, the liver samples were embedded and cut into 10 μm thick sections. The sections were stained with Oil Red O (ORO) and counter-stained with hematoxylin and observed under fluorescence microscope.
2.10. Quantitative Real-Time PCR. Total RNA was isolated from the liver samples using TRIzol reagent (Invitrogen, USA), and cDNA synthesis was performed using the ImProm-IITM Reverse Transcription System (Promega, USA). The primers were synthesized by Takara (Dalian, China) with the sequences listed in Table 1  2.11. Statistical Analysis. IBM SPSS 25.0 and R (version 3.6.2) were utilized to perform statistical analyses. The normal distribution of the variables was checked by the Shapiro-Wilk test. All statistical tests were two-sided. The Student's t-test for independent samples or nonparametric Mann-Whitney U test was conducted to compare the differences between groups depending on the distribution pattern of the variables. P < 0:05 represented a significant difference.

Weighted Coexpression Network Construction and Key
Module Identification. A total of 126,135 genes generated from 57 samples of GSE126848 were used to construct a hierarchical clustering tree (Figure 2(a)). To determine the adjacency matrix, β = 9 was chosen with the scale-free topology criterion. In this case, the scale-free topology fit index reached values of 0.9 for the soft-thresholding power ( Figure 2(b)). Meanwhile, the MEDissThres was set as 0.25 for merging similar modules ( Figure 2(c)), and a total of 28 coexpression modules were constructed (Figure 2(d)). In addition, a gray module was used to collect genes not assigned to any modules and was excluded from further analyses. Notably, these modules were independent of other modules.

Module-Trait Correlations in NAFLD and Identification
of Hub Genes. We further analyzed the correlations of the various gene modules with distinct NAFLD status and found that the steelblue module was most positively correlated with NAFLD (correlation coefficient = 0:8, P value = 3E − 07) (Figure 2(e)). Based on the above findings, the steelblue module was identified as the most significant module related to NAFLD, and subsequent analyses were applied to the extracted genes from this module.

Functional and Pathway Enrichment
Analysis. GO and KEGG analysis of the genes within the steelblue module revealed the following enriched BPs (Figure 3(a)): ribosome biogenesis, negative regulation of cell cycle phase transition, assembly of mitochondrial respiratory chain complexes, mitochondrial localization of proteins, oxidative phosphory-lation, negative regulation of G2/M cell cycle transition, and establishment of protein localization to the mitochondrion. Moreover, the CCs were mainly enriched in the ribosome, mitochondrial protein complex, methyltransferase complex, condensed chromosome, centromeric region, and condensed chromosome kinetochore (Figure 3(b)), while enriched MFs were predominantly involved in structural constituents of the ribosome, drug transmembrane transporter activity, chaperone binding, threonine-type peptidase activity, threonine-type endopeptidase activity, and sulfur compound transmembrane transporter activity (Figure 3(c)). KEGG analysis identified thermogenesis as the most enriched pathway, followed by the ribosome and NAFLD as well as oxidative phosphorylation. Taken together, these data suggested a strong correlation between the genes in the module and the mitochondrial protein complex (Figure 3(d)).

Common Hub
Genes of the Steelblue Module Identified by WGCNA and Cytoscape. As summarized in Table S1, the steelblue module was comprised of 176 genes. Among them, 44 hub genes were identified based on the set thresholds as MM > 0:83 and GS > 0:58 (Table S2). Furthermore, all 176 genes were imported into the STRING database to construct PPI network and visualized with Cytoscape ( Figure 4(a)). Then, plug-in cytoHubba in Cytoscape was performed, and 30 hub genes were identified by degree method (Figure 4(b) and Table S3). Of 30 hub genes, 15 were derived from both the coexpression and PPI networks and considered "true" hub genes (Figure 4(c), Table 2).
3.5. Reassessment of "True" Hub Genes Using KEGG Analysis. KEGG pathway enhancement was reevaluated via DAVID to identify the pathways of the 15 selected DEGs. Of the 15 genes, UQCRQ, NDUFA9, and NDUFB8 were found to be markedly enriched in NAFLD as well as oxidative phosphorylation (Table 3).
3.6. Validation of Hub Genes Based on Dataset. The above three candidate hub genes were selected for validation in dataset GSE89632. As shown in Table 4, NDUFA9 and UQCRQ levels decreased in the NAFLD group compared to control group, but there was no significant difference in

Validation of NDUFA9 and UQCRQ in NAFLD Mouse
Model. To validate the role of NDUFA9 and UQCRQ in NAFLD, we established NAFLD mouse model. HE staining of liver tissue section showed that the structure of the hepatic lobules in control group was complete and clear, the boundary of the portal area was clear, the structure of the hepatocytes was normal, the hepatocytes were distributed radially around the central vein, and the liver sinusoids were visible ( Figure 5(a)). In contrast, the hepatocytes of model group were swollen, some hepatocytes showed ballooning degeneration, the nucleus was squeezed to one side, most of the hepatocytes contained lipid droplet vacuoles of varying sizes and numbers, and the hepatic cords were arranged disorderly ( Figure 5(b)). Oil red O staining of liver tissue showed no red-stained lipid droplets in the livers in the control group ( Figure 5(c)). In contrast, the hepatocytes in the model group were diffuse, granular, and fused into flaky red-stained lipid droplets ( Figure 5(d)). Next, we detected the expression of three candidate hub genes in the livers of NAFLD and control groups. We found significantly decreased expression of NDUFA9 and UQCRQ in the NAFLD model group compared to the control group, but no significant difference in the expression of NDUFB8 between NAFLD and control groups ( Figure 6).

Discussion
In recent years, a number of candidate causal genes for NAFLD, including PNPLA3, PPP1R3B, SAMM50, and TRIB1, have been identified based on the genome-wide association studies [24][25][26]. However, the exact mechanisms underlying the development of NAFLD remain to be determined. Most studies focused on the comparison of the gene expression between NAFLD patients and normal individuals. For example, Hoang et al. [27] and Frades et al. [7] employed DEG-seq to analyze the differentially expressed genes but did not perform dataset or laboratory verification. In contrast, in this study, we verified the selected important hub genes not only in an independent dataset but also in a high-fat feed-induced NAFLD mouse model.
To determine key genes and pathways in the pathogenesis of NAFLD, we performed WGCNA and identified fifteen "true" hub genes that are shared in both the coexpression and PPI networks. Furthermore, KEGG pathway analysis of these "true" hub genes revealed that three genes NDUFB8, NDUFA9, and UQCRQ were enriched in NAFLD and oxidative phosphorylation highly related to NAFLD pathogenesis. Therefore, we speculated that these three genes may be the key genes involved in the pathogenesis of NAFLD. In another independent dataset GSE89632, although the difference in logFC values between NDUFA9 and UQCRQ was weak, the downward trend was obvious. qRT-PCR assay verified that NDUFA9 and UQCRQ were the hub genes. Furthermore, we demonstrated significantly decreased expression of NDUFA9 and UQCRQ in NAFLD mouse model compared to the control. NDUFA9 (NADH: ubiquinone oxidoreductase subunit A9) gene encodes a subunit of the mitochondrial membrane respiratory chain NADH dehydrogenase (complex I). Instead of being implicated in the catalysis, NDUFA9 is required for proper assembly of the complex I [28]. Wang et al. [29] reported that metformin protected the mitochondrial structure of mouse colorectal epithelial cells by   14 BioMed Research International increasing NDUFA9 expression to inhibit colitis and colitisrelated cancers. Another study [ 30] showed that pioglitazone could bind to the complex I subunits NDUFA9, NDUFB6, and NDUFA6, inducing the complex disassembly, decreasing its activity, and promoting the expression of nuclear DNA-encoded subunits of the complex I in mice and HepG2 cells. Houtkooper et al. [31] reported that NR supplementation in mammalian cells and mouse tissues led to an increase in NAD(+) levels as well as the activation of SIRT1 and SIRT3, facilitating oxidative metabolism and protecting against metabolic abnormalities induced by a highfat diet. These data suggest that NDUFA9 may be involved in NAFLD by regulating the activity of complex I subunits. KEGG analysis revealed that the three hub genes identified in this study are predominantly enriched in the oxidative phosphorylation pathway. Oxidative phosphorylation represents a process by which the energy of adenosine triphosphate (ATP) can be efficiently produced. Oxidative phosphorylation was decreased in individuals with NAFLD [32]. Moreover, HFD-fed mice showed reduced activities of oxidative phosphorylation enzymes, which can be attributed to decreased amount of fully assembled complexes [33]. Oxidative phosphorylation dysfunction has been shown to cause oxidative stress involving cytochrome P4502E1, xanthine oxidase, NADPH oxidase, and liver mitochondria [34][35][36].
UQCRQ (ubiquinol-cytochrome c reductase, complex III subunit VII) is a gene encoding a low-molecular mass ubiquinone-binding protein. UQCRQ is identified as a small core-associated protein and a subunit of ubiquinolcytochrome c reductase complex III [37]. Although no report showed that UQCRQ is directly related to NAFLD, UQCRQ is associated with oxidative phosphorylation pathway [38]. Based on the relationship between oxidative phosphorylation and NAFLD, we speculate that UQCRQ is linked to NAFLD and may be a key gene in the pathogenesis of NAFLD.
In this study, we identified the key regulatory genes and pathways involved in NAFLD using the integrated method of bioinformatics, including WGCNA, functional genomics, and gene regulatory network. We found that key genes and pathways such as NDUFA9, UQCRQ, and ribosome, as well as proteasome and oxidative phosphorylation are essential to the pathogenesis of NAFLD. The present study has the following limitations: First, our data obtained from the WGCNA remain to be validated in an independent cohort due to the single platform of dataset used in the study. Second, the absence of clinical traits such as serum biomarker profiling, liver biopsy histology for scoring hepatic steatosis, and fibrosis in the raw data might affect the assessment of NAFLD phenotypes. Third, the use of mouse rather than human samples may restrict the capacity to detect the differences in gene expression in NAFLD patients.
In conclusion, based on WGCNA analyses, we identified 15 NAFLD-related candidate hub genes in the steelblue module. Moreover, we found that the proteasome, oxidative phosphorylation, NAFLD, and ribosome may be involved in the pathogenesis of NAFLD. Importantly, we established mouse model of NAFLD and verified two hub genes NDUFA9 and UQCRQ, which may act as biomarkers and therapeutic targets for NAFLD.

Data Availability
All data are available upon request.